3.1 Statistical description
Central tendency of continous single variables
Typical value of work hours (ratio variable)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 29.00 38.00 36.48 44.00 110.00
Typical value of age (ratio variable)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 30.00 41.00 40.45 50.00 64.00
Those two variables have no clear outliers.
Central tendency of univariate variables
The difference in work hour between men and women
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## gender `mean(workHour)`
## <fct> <dbl>
## 1 Male 41.0
## 2 Female 32.3
Use cross tabulation to explore the relationship between gender and other independent variable which indirectly influenc work hour.=
##
## Male Female
## Agriculture, forestry and fishing 0.7500000 0.2500000
## Manufacturing 0.7156627 0.2843373
## Energy and water supply 0.7285714 0.2714286
## Construction 0.8603352 0.1396648
## Distribution, hotels and restaurants 0.4474474 0.5525526
## Transport and communication 0.7816092 0.2183908
## Banking and finance 0.5027027 0.4972973
## Public admin, education and health 0.2863777 0.7136223
## Other services 0.5255102 0.4744898
##
## Male Female
## Degree or higher 0.4592094 0.5407906
## Higher education 0.5185185 0.4814815
## A level or equivalent 0.4309463 0.5690537
## Secondary 0.5129108 0.4870892
## Other 0.5531915 0.4468085
##
## Male Female
## Managers 0.6126984 0.3873016
## PProfessionals 0.4523507 0.5476493
## Assoc. professionals 0.5311871 0.4688129
## Administrative 0.2359813 0.7640187
## Skilled Trade 0.8941980 0.1058020
## Caring & Leisure 0.1549708 0.8450292
## Sales & cust services 0.3411371 0.6588629
## Machine Operatives 0.8652174 0.1347826
## Elementary occupations 0.4822335 0.5177665
We can see clear difference in industry, major group propotion between man and women, but little difference in education.
3.3 Exam each independent variable and its sampling error
In-depth study of the effects of several influencing factors of work hour on it, in preparation for the establishment of multiple linear regression model.
3.3.1 Work hour versus gender
## `summarise()` ungrouping output (override with `.groups` argument)
| Male |
40.97442 |
163.3164 |
12.77953 |
| Female |
32.32708 |
187.8747 |
13.70674 |

Mean work hour varies between the gender. There are more variation of work hour for women and the distribution is less compressed, which may because a large amount of women do short-time job (Quartile is very small). So, it may cause pure heteroscedasticity.
Exam sampling error
The average work hour between men and women in population is equal
The average work hour between men and women in population is not equal
t-value: where the coefficient could be zero in the population. Reject the null hypothesis if the probability (p-value) is equal to or less than 0.05.
##
## Welch Two Sample t-test
##
## data: workHour by gender
## t = 19.548, df = 3582.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.780025 9.514657
## sample estimates:
## mean in group Male mean in group Female
## 40.97442 32.32708
The t-value is 19.35, p-value < 0.05. If the null hypothesis is true, then there is a probability of 0.00000 of observing a p-value as or more extreme than 19.35. We reject null hypothesis and think the average work hour of men is statistically significantly different from that of women.
3.3.2 Work hour versus highest qualification
## `summarise()` ungrouping output (override with `.groups` argument)
| Degree or higher |
38.89571 |
173.5430 |
13.17357 |
| Higher education |
36.21417 |
199.2557 |
14.11580 |
| A level or equivalent |
35.15857 |
198.8737 |
14.10226 |
| Secondary |
35.06103 |
205.4440 |
14.33332 |
| Other |
33.07801 |
187.8010 |
13.70405 |
Mean work hour varies with the education level. Disparity of work hour among higher education, A level or equivalent and Secondary is very little. There are more variation of work hour for workers with higher education level, which may because people with lower education level start to work Earlier to do some short-time work, making the distribution less compressed. But higher education level people tend to do stable full-time job generally. There are different degrees of compression of work hour distribution among education level, which may couse pure heteroscedasticity.
Exam sampling error
The mean work hour is equal across the education groups
At least one of the education groups mean work hour is different from that the others.
f-value: compare the variation among the groups within the variation within the groups. If the variation among the groups is larger, the groups are important.
## Df Sum Sq Mean Sq F value Pr(>F)
## education 4 11695 2923.8 15.26 2.25e-12 ***
## Residuals 3580 686153 191.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to f-value and its p-value, hypothesis can be rejected. Education level does explain a statistically significant portion of variation in work hours.
3.3.3 Work hour versus industry
## `summarise()` ungrouping output (override with `.groups` argument)
| Agriculture, forestry and fishing |
42.06250 |
210.0625 |
14.49353 |
| Manufacturing |
40.24578 |
139.9974 |
11.83205 |
| Energy and water supply |
40.77143 |
121.2224 |
11.01010 |
| Construction |
42.84358 |
170.4473 |
13.05555 |
| Distribution, hotels and restaurants |
31.59009 |
224.6362 |
14.98787 |
| Transport and communication |
41.91954 |
149.3461 |
12.22072 |
| Banking and finance |
35.94595 |
154.0623 |
12.41218 |
| Public admin, education and health |
35.85604 |
186.8716 |
13.67010 |
| Other services |
36.66497 |
198.7138 |
14.09659 |
Mean work hour varies with the industry, Average work hour in some industry are similar, such as Manufacturing ,and Energy and water supply. There are more variation of work hour for workers in some industry, like Agriculture and forestry and fishing, Distribution, hotels and restaurants, Public admin, education and health, since work hour of these kind of job are flexible, causing much variation. Distribution of work hour in some industry are more compressed, like Energy and water supply, Banking and finance where people have more stable work time. So there may be pure heteroscedasticity.
Exam sampling error
The mean work hour is equal across the industry groups
At least one of the industry groups mean work hour is different from that the others.
f-value
##
## Call:
## lm(formula = workHour ~ industry, data = list)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.844 -6.920 1.144 7.410 73.335
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 42.0625 3.3996 12.373
## industryManufacturing -1.8167 3.4646 -0.524
## industryEnergy and water supply -1.2911 3.7682 -0.343
## industryConstruction 0.7811 3.5483 0.220
## industryDistribution, hotels and restaurants -10.4724 3.4402 -3.044
## industryTransport and communication -0.1430 3.5525 -0.040
## industryBanking and finance -6.1166 3.5436 -1.726
## industryPublic admin, education and health -6.2065 3.4206 -1.814
## industryOther services -5.3975 3.4456 -1.567
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## industryManufacturing 0.60005
## industryEnergy and water supply 0.73190
## industryConstruction 0.82579
## industryDistribution, hotels and restaurants 0.00235 **
## industryTransport and communication 0.96790
## industryBanking and finance 0.08442 .
## industryPublic admin, education and health 0.06970 .
## industryOther services 0.11732
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.6 on 3576 degrees of freedom
## Multiple R-squared: 0.0524, Adjusted R-squared: 0.05029
## F-statistic: 24.72 on 8 and 3576 DF, p-value: < 2.2e-16
According to f-value and its p-value, hypothesis can be rejected. Industry groups does explain a statistically significant portion of variation in work hours
3.3.4 Work hours versus major group
## `summarise()` ungrouping output (override with `.groups` argument)
| Managers |
42.52063 |
190.8109 |
13.81343 |
| PProfessionals |
39.87548 |
147.5621 |
12.14751 |
| Assoc. professionals |
38.36016 |
170.5817 |
13.06069 |
| Administrative |
32.77804 |
132.8569 |
11.52636 |
| Skilled Trade |
41.61775 |
154.2712 |
12.42060 |
| Caring & Leisure |
32.42398 |
199.1716 |
14.11282 |
| Sales & cust services |
29.86622 |
184.9820 |
13.60081 |
| Machine Operatives |
40.55652 |
163.3833 |
12.78215 |
| Elementary occupations |
28.81980 |
238.0056 |
15.42743 |

Mean work hour varies with the major Group, There are more variation of work hour for workers in some major group, like Caring & Leisure, Elementary occupations(mainly require the use of hand-held tools and often some physical effort). Their mean work hour are relatively shorter. People in that major are not likely to have stable work time. The smallest variation appears at administrative, since this major often have fixed working hours. So this may cause pure heteroscedasticity.
Exam sampling error
The mean work hour is equal across the industry groups
At least one of the industry groups mean work hour is different from that the others. * test statistics
f-value
## Df Sum Sq Mean Sq F value Pr(>F)
## majorGroup 8 81571 10196 59.16 <2e-16 ***
## Residuals 3576 616277 172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to f-value and its p-value, hypothesis can be rejected. Major groups does explain a statistically significant portion of variation in work hours.
3.3.5 Work hours versus region
## `summarise()` ungrouping output (override with `.groups` argument)
| England |
36.73779 |
198.0847 |
14.07426 |
| Wales |
34.31977 |
145.8562 |
12.07709 |
| Scotland |
35.17692 |
199.1114 |
14.11069 |
| Northern Ireland |
35.58824 |
154.9575 |
12.44819 |

Mean work hour varies with the industry. Some industry are similar, such as Agriculture, forestry and fishing, and manufacturing
Exam sampling error
The mean work hour is equal across the industry groups
At least one of the industry groups mean work hour is different from that the others.
f-value
## Df Sum Sq Mean Sq F value Pr(>F)
## region 3 1528 509.3 2.619 0.0492 *
## Residuals 3581 696320 194.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to f-value and its p-value, hypothesis can be rejected. Regions does explain a statistically significant portion of variation in work hours. According to 1528/1528+696320 = 0.002, it can only explain 0.2% of the work hour variation. It is not a considerable amount, so exclude the variable.
3.3.6 Work hours versus maritalStatus
## `summarise()` ungrouping output (override with `.groups` argument)
| Single, never married |
33.95921 |
193.1076 |
13.89632 |
| Married/cohabitating |
37.48189 |
193.5925 |
13.91375 |
| Divorced/widowed |
36.23867 |
183.5944 |
13.54970 |

Mean work hour varies with the marital status. The variations of different marital status are similar, so it is unlikely to cause pure heteroscedasticity.
Exam sampling error
The mean work hour is equal across the industry groups
At least mean work hour of one of the marital status is different from that of the others.
f-value
## Df Sum Sq Mean Sq F value Pr(>F)
## maritalStatus 2 8139 4069 21.13 7.51e-10 ***
## Residuals 3582 689710 193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to f-value and its p-value, hypothesis can be rejected. Marital status does explain a statistically significant portion of variation in work hours.
3.3.7 Work hour versus age

There is not a clear pattern.People between 25 and 50 years old have some outliers who work long hours. Because this age group is in better physical condition, they are more likely to spend more time at work. Then explore the relationship between age and work hour
##
## Pearson's product-moment correlation
##
## data: list$workHour and list$age
## t = 3.6069, df = 3583, p-value = 0.0003141
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02746640 0.09270246
## sample estimates:
## cor
## 0.06014865
The r-value is 0.06, so there is a very weak relationship between work hour and age. 即存在趋势总体上随着年龄增长,工作时间变长。 Exam sampling error
Correlated coefficient for age and work hour in population is equal to 0.
Correlated coefficient is not equal to 0.
t-value is 3.6, and p-value is less than 0.05, so hypothesis is rejected.